Language modeling by variable length sequences: theoretical formulation and evaluation of multigrams

نویسندگان

  • Sabine Deligne
  • Frédéric Bimbot
چکیده

The multigram model assumes that language can be described as the output of a memoryless source that emits variable-length sequences of words. The estimation of the model parameters can be formulated as a Maximum Likelihood estimation problem from incomplete data. We show that estimates of the model parameters can be computed through an iterative Expectation-Maximization algorithm and we describe a forward-backward procedure for its implementation. We report the results of a systematical evaluation of multi-grams for language modeling on the ATIS database. The objective performance measure is the test set perplexity. Our results show that multigrams outperform conventional n-grams for this task.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Multigrams for language identification

In our paper we present two new approaches for language identification. Both of them are based on the use of so-called multigrams, an information theoretic based observation representation. In the first approach we use multigram models for phonotactic modeling of phoneme or codebook sequences. The multigram model can be used to segment the new observation into larger units (e.g. something like ...

متن کامل

Inference of variable-length linguistic and acoustic units by multigrams

The efficiency of pattern recognition algorithms is highly conditioned to a proper definition of the patterns assumed to structure the data. The multigram model provides a statistical tool to retrieve sequential variable-length regularities within streams of data. In this paper, we present a general formulation of the model, applicable to single or multiple parallel strings of data having eithe...

متن کامل

Speech spectrum representation and coding using multigrams with distance

The multigrams allow us to split a string of symbols into a stream of variable length sequences. The direct application of this method to vector-quantized speech spectra fails, we develop an extension of the method called modiied multi-grams or multigrams with distance. The algorithm for mod-iied multigram dictionary training as well as experimental results are presented. We found a signiicant ...

متن کامل

A New Finite Element Formulation for Buckling and Free Vibration Analysis of Timoshenko Beams on Variable Elastic Foundation

In this study, the buckling and free vibration of Timoshenko beams resting on variable elastic foundation analyzed by means of a new finite element formulation. The Winkler model has been applied for elastic foundation. A two-node element with four degrees of freedom is suggested for finite element formulation. Displacement and rotational fields are approximated by cubic and quadratic polynomia...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1995